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The amount of publicly available information resources on the Web is increasing rapidly. As this trend 
continues, finding these resources becomes more difficult. Several systems, notably Archie, Jumpstation, 
Lycos, WebCrawler, RBSE Index, and Harvest have attempted to solve this problem by building indices and 
allowing users to search them. 


These Web-wide indexing solutions do not, in practice, share information with each other. In addition, the 
databases often fail to provide users with specific or complete answers to their queries. 


This workshop has two goals: to find ways that index providers can share indexing and Web-structure 
information, and to explore ways to improve the query experience for users. Both goals stem from the need to 
address the increasing scale of the Internet: as the size of the problem increases, we need to be more efficient at 
building indices, and better at focusing on what the users are looking for. As we find solutions to these 
problems, it will enable us to build more efficient, consistent, and powerful indices of the World—Wide Web. 


In addition to a general discussion of Web-wide indexing, the workshop will have two specific tasks: 


1. to examine some of the existing protocols for sharing information to see what we can use and what we 
need to build, and 
2. to envision an operational plan for putting these tools to use on an experimental basis. 


Allowing indices to cooperate and exchange information can help solve several problems. First, it would allow 
the index builders to do a more efficient job of indexing. Indexing information for Europe might be collected 

in Europe, then transmitted in bulk to the United States. Or, we may find ways to build a distributed index, and 
avoid even the bulk transmission. Second, it would enable different retrieval engines to run against the same set 
of indexing information, providing better service for users and opportunities for research on different kinds of 
retrieval. Finally, it would server as an experimental tool for learning about decentralized indexing. 


This workshop is not meant to set any standards for indexing or exchange of indexing information. However, if 
it serves as a starting point for the experimentation that will give us the experience necessary to propose 
standards in the future, should that be desirable. 


The workshop environment is the ideal place to do this work; we will bring together people with experience 
building and running Web-wide indices. The ideal participant would be involved in building or operating an 
Internet information discovery system, or an expert in the field of database systems, distributed computing, 
expert systems, information retrieval, or library studies. 
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Summary 


What to index? 


Use the anchor term used for the HTML links, 
Use the title and headings of the HTML page, 
Use the full text to create an index, 

Use the filename of the HTML resource, 
Word occurrence — URL pairs. 

Inverted indices of keywords 

Indexes of interesting keywords 


How are index created? 


e Use a robot to scan the Web for new and changed HTML resource 
e Server side support based systems, 
e Different frequency of updates 


What is indexed? 


e Approximately twenty search engines with accompanying services for parts of the WWW, 
e Each covers a part of the Web, 
e Gateways to other indexing services such as WAIS 


What is needed? 


Establish a Web Indexers’ Working Group, 

Hierarchical searching, 

Share information and avoid replication, 

Parallel, fault-tolerant, and scalable index server, 
Language(Natural) independent 

Find all the having concept/property ----—— ; 
Index other form of resources: images, graphics or sound, 


e Capture structure of the Web, hyper—media, 
e Common interface, 
e Index generation with revision control 
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Abstract 


This paper describes an indexing system called semantic header for Internet resources. 


The semantic header contains the meta—information for each "publicly" accessible 
resource on the Internet. It also describes the registering system and the distributed 
database representing the union catalog of resources on the Internet. This database 
would be used in a search system to facilitate search. 


Introduction 


The trend in most research institutes, universities and business organization to 
interconnect their computing facilities using a digital network has become the 
accepted method of sharing resources. Such networks, in turn, are interconnected 
allowing information to be exchanged across networks using a common interchange 
protocol(viz TCP/IP). The number of such interconnected networks (Internet) 
continues to grow and with the emergence of powerful workstation—based servers 
connected to these networks, it is possible to support local as well as the remote 
search and retrieval of information stored on any component of the interconnection. 
At this time a number of information sources, both public(free) and private(available 
for a fee), are available on the Internet. They include text, computer programs, books, 
electronic journals, newspapers, organizational, local and national directories of 
various types, sound and voice recordings, images, video clips, scientific data, and 
private information services such as price lists and quotations, databases of products 
and services, and speciality newsletters. 


There is a need for the development of a system which allows easy ’search for and 
access to’ resources available on the Internet. It has been observed that distributed 
information systems, even though under control of a single administrative unit, create 
multiple problems typically caused by differences in semantics and representation, 
incomplete and incorrect data dictionaries (cataloging) [DESA4]. These problems 


would be magnified manyfold in any distributed information system which tries to 
integrate the resources offered by information systems over the Internet. It is 
important, also, to avoid problems encountered in a library system where, in spite of 
the fact that while the same cataloging system[2] is used, the same item may be 
differently catalogued/classified in two different libraries. 


Such problems could be avoided by starting with a standard index structure and 
building a bibliographic system using standardized control definitions. Such 
definitions could be built into the knowledgebase of an expert system based index 
entry and search interfaces. Furthermore, there must be a mechanism to revise index 
information as the resource changes over time. Finally, annotation of a resource by 
independent users should be allowed. 


The bibliographic entry system should be distributed and accessible to providers as 
well as users of the Internet. In a distributed system such as the Internet, it is natural 
to have the providers of resources, prepare and enter the bibliographic information 
about each resource using the standardized index scheme. The entry system should be 
a distributed system and the index should be recorded in a distributed database. 
Finally, a search system to help in locating and retrieving appropriate information 
with ease from this database is required. 


Whereas the bibliographic entry and search systems (clients) could be located locally 
at the providers and users of information resources respectively, the bibliographic 
database system(server) should be distributed and replicated at a number of regional 
nodes for enhanced availability and response. The entry and search systems have to be 
supported by an easy-to-use graphical interface for entering the index information 
and access to it. These systems should incorporate the expertise and knowledge of 
expert cataloguers and reference librarians with help system to guide the user at all 
steps. The search system, should in addition provide appropriate feedback indicating 
the number of hits for each search, and help in providing access to the relevant 
resources. The navigation of database and resource nodes and the protocols and filters 
used would be selected by the system, thus facilitating the task of the user. The 
purpose is to provide uniform access to all resources, as is done in the centralized 
information system through the intermediary of an expert system analyst. The overall 
structure of such a system is given in Figure 1 


Source of Information and Meta—Information 


Information sources can be classified into three categories[ KATZ]: primary, 
secondary and tertiary. Primary information is the original material in the form of 
published or posted articles, monographs, reports, dissertations, programs, images, 
movies, etc. Other primary sources such as personal communications are not usually 
available. Secondary sources, sometimes called meta—information, are used as indices 
to these primary sources of information and are created after a delay which may be a 
few months to a few years. The meta—information is data about the primary source. A 
tertiary source of information is a combination of selected and distilled information 


from primary and secondary sources. 


The purpose of indices and bibliographies (secondary information) is to inventory the 
primary information and allow easy access to it. Preparing a bibliography requires 
finding the primary source, identifying it as to its subject, etc., describing it for later 
matching for unknown future users and classifying it according to accepted norms. 


Since an index is to be used by many users, it has to be accurate, easy to use (usage via 
author, title, subject, etc.) properly classified, up—to date and complete for its area of 
coverage. In order for a bibliography to be useful, it must fill a real need. The success 
of Archie as a bibliography system (for files available on the Internet via FTP) is that 
it provides a simple interface to users who are aware of the name of a program, file or 
the general nature of the file likely distributed from one or more anonymous FTP 
sites. In the case of the on-line bibliography to the Internet resources such as the Web, 
the need is for the system to be current within a short period (minutes or at most 
hours) of the posting of a new resource. Compare this with the bibliography system 
for printed publication which requires weeks or months in the case of the on-line 
databases, longer for the CD version and upto years for the printed version. Even the 
on-line database needs a considerable amount of time before documents are indexed 
in bibliography. 


The method of compiling a traditional bibliography varies. At one extreme, we have 
scholars spending years of their lives evaluating sources and compiling annotated and 
descriptive entries for each item. The accuracy of this bibliography is high but the 
coverage tends to be limited. At the other extreme, we have the semi—automatic 
mechanism which scans the published works from limited sources (by domain, 
language, or geographic regions) and assigns each work to appropriate sub—subject(s). 
Access from multiple headings may be provided. This is desirable because an item 
may deal with more than one topic. Whereas the bibliography prepared in the former 
method could be more accurate it tends, however, to be retrospective rather than 
current. 


The dependence on titles as a search criterion dictates that they must be indicative of 
the contents of the document. This is not always the case hence someone (the author or 
the cataloger) has to add annotation, keywords or key phrases to indicate the actual 
content. Accuracy or quality of a document can be indicated by including reviewers’ 
opinions. However, such opinions are rarely accessible to the cataloger. Another 
feature of importance to the user of an index, is the presence of an accurate abstract. 
An abstract provides a summary of the material and thus is more indicative of the 
contents than the title or keywords supplied by the author, bibliographer or selected 
from scanning the text. Reference librarians and library users tend to use such 
annotated bibliographies to help choose among competing sources. 


Features such as division of the bibliography by subject and sub-subjects, though of 
concern in the manual systems, should not be apparent in the electronic form. 
However, access through these criteria must be supported. Weeding of bibliography 


entries, which are for Internet resources no longer accessible, though attractive may 
require careful thought from the point of completeness. The archiving of resources in 
central libraries could mean that such weeding of the bibliography would not be 
necessary. 


A Cataloging and Searching System 


Library catalogs are prepared by a specialist and for each entry, it records the author, 
title, publisher, place of publication, date of publications and other details. The term 
union list, in library lexicons, is used to refer to the catalog which is the union of the 
catalogs of a number of participating libraries. It indicates which item is located 
where. In this sense, the bibliography, forms a union list of all sources of documents. 
Since the item in question is not in electronic form, it requires the intermediary of the 
inter—library loan mechanism to borrow it (usually from the nearest location which 
permits the title to be borrowed or if possible to photocopy sections of it.) 


Currently a large number of documents exist in addition to the files whose names 
could be searched via systems such as Archie or Xarchie. The popularity of the World 
Wide Web[BERN, BERN3] and browsers such as Mosaic[MOSA] has prompted 
many researchers to start publishing on-line. Attempts to provide easy searching of 
relevant documents has lead to a number of systems including WAIS, and more 
recently a number of Spiders, Worms and other creepy crawlers.[DEBR, FLET, 
KOST, MCBR, META, THAU, SEAR, WEBC, WWWW] 


However, the problem with many of these indices is that their selectivity of 
documents is often poor. The chances of getting correct documents and missing 
relevant information because of poor choice of search terms is large. In addition, the 
user is required to access the actual resource, based on just the title and author 
information, as is provided through a library catalog, and decide whether the resource 
meets the needs. 


These problems are addressed in our proposed system by using an appropriate index 
entry called Semantic Header[BCD2] and providing a mechanism to register, manage 
and search the bibliography. The system is an active system requiring the provider of 
information to register the resource by entering an index entry for the resource. Since 
the provider is responsible for preparing the index entry, there is the potential for its 
accuracy to be high. 


The overall system uses knowledge bases and expert sub-systems to help the user in 
the register and search process. One such need for an expert system is in avoiding 
chaos introduced by differences in perception of different indexer. Hence, some form 
of standardization of terms used has to be enforced. We envisage this through the 
intermediary of an expert system based engine. The index generation and maintenance 
sub-system uses the knowledge and expertise of the expert cataloguer to help the 
provider of the resource select correct terms for items such as subject, sub—subject and 
keywords. Similarly, another expert system is used in the search sub-system to help 


the user in the search for appropriate information resources. The third component of 
the system is a distributed and replicated database of the bibliography to resources 
available on-line. The database is in the background and the users are not aware of its 
presence much less of its distributed and replicated nature. These components are 
described below. 


Semantic Header 


The heart of any bibliography or indexing system is the record that is kept for each 
item that is being indexed. Standardization of a bibliographic entry allows libraries to 
exchange information about their collections. A number of projects in the Library 
domain have addressed the problem of cataloging and in particular cataloging of 
information in electronic and multi-media format. CORE[CROM], MARC 
system[BRYN, CRAW, MARC, PETE], MLC[HORN, ROSS, RHEE] and 
TEI[GAYN, GIOR] are examples of some of these initiatives. These existing and 
proposed indexing systems range from a minimum to full level of bibliographic 
information. However, such systems are designed for professional catalogers and 
many of the items included in them, though useful, are beyond the comprehension of 
most providers or users of information. 


We have proposed a simple index structure called Semantic Header [DESA2] for 
resources accessible directly on the Internet. The structure of the index is similar to 
the ones used for most libraries indices and include other information deemed useful 
for on-line systems. The syntax of the semantic header is the HTML markup 
language[BERN2] which is based on the SGML markup language. However, the user 
working with the index entry system is guided through the process by an expert 
system. This system guides the user in the choice of standardized terms through an 
easy to use graphical interface. 


We give, in Figure 2 below, the structure of the Semantic Header. An example of use 
of the semantic header is given in Figure 3. The intent of the semantic header is to 
include those items that are most often used in the search of an information resource. 
Since the majority of search begins with a title, name of one of the authors (70%), 
subject and sub—subject (50%)[Katz], we have made the entry of these items to be 
mandatory in the semantic header. The abstract and annotations are useful in deciding 
whether the resource would be useful; these items are also included. Logically the 
entries in the semantic header are not positionally sensitive. However, for ease of use, 
we have arranged the fields in Figure 2 using the traditional library catalog layout. 


The first field of the semantic header is the title of the resource. It is a required field 
and is given within the tags beginning with <title> and terminated by </title>. The 
next field is a alt—title and is used to indicate a secondary title or an alternate title of 
the resource. This field is optional. The subject and the sub—subject of the resource is 
indicated in the next field which is a repeating group (a multi-part field with one or 
more occurrences of items in the group). All resources must have at least one 
occurrence for this field. 


The character set used and the language of the resource is given in the next two 
optional fields. The author of the resource is given in the next repeating group. The 
sub-fields are for name, organization, address, phone and fax numbers and e—mail 
address. All sub-fields except the name are optional except where the author is an 
organization in which case the organization must be given. The term author is used to 
include the role of programmer, creator, artist, etc. 


The list of keywords is included by a field marked by the tags <Keyword> ... 
</Keyword>. Each resource must have at least one keyword. If a published version of 
the resource is available, this is indicated by the next field which is followed by a 
place of publication and appropriate publication code (code name and number) such as 
ISBN followed by a number. 


The dates of creation(required), expiry and update, if any, are given next. The version 
number, if any, the intended coverage and the security or distribution classification is 
indicated in the next three fields. 


The location (URL[BERN1]) of the item is indicated by the next field indicated by 
the tags <URL> ... </URL>. It could include a list of one or more locations where the 
item may be available. The URN[RFC1737] field gives the unique name of the item, 
if any. This name may be used instead of a location (URL) if the item is likely to 
move or may be accessible from multiple locations[3]. 


The semantic header contains an entry for an archive site. The field UAS (Universal 
archive site) is used to indicate the archive site for the resource. It is expected that the 
resource will exist at this site beyond the expiry date of the resource, if any. Of 
course, the site itself is guaranteed to exist beyond the life of any resource. It is 
envisaged that the archive site could be an independent resource provider. One 
example of such a traditional resource provider is the national library in most 
countries. One possibility is for the national libraries such as the Library of Congress 
in U.S., British Library, National Library and CISTI in Canada, to archive Internet 
resources. However, private, for profit, corporations could be alternate sites for 
archiving resources. Archiving would provide an anchor for the otherwise ephemeral 
nature of some resources on the network. 


The abstract and annotations are given in the next fields. The abstract is provided by 
the author of the resource; the annotations are made by independent users of the 
resource. The annotation cannot be modified and includes the identity of the user 
along with a digital signature. 


List of hardware and software required is included in the semantic header as a 
repeating group. This is followed by the size of the resource and the cost of accessing 


it{4]. 


The last set of items in the semantic header is the control items such as the account to 
which credits are to be made for charges for accessing the resource, encoded 


passwords or the digital signature of the provider of the resource. Any change to the 
updatable part of the semantic header requires the password or digital signature. 
Another control piece of information is the digital signature of the resource itself. 
This may be used to authenticate the resource when it is retrieved through a semantic 
header. It is assumed that there is a mechanism to access the resource’s digital 
signature. 


<semhdr> 

<title> required </title> 

<alt—-title> OPTIONAL </alt—title> 

<Subject> a list each of which includes fields for subject and up to two levels of 
sub-subject: at least one entry is required </Subject> 

<char-set> character set used: OPTIONAL </char-set> 

<language> of the information resource: OPTIONAL </language> 

<author> required 

a list each of which includes name, organization, address, etc. of each person/institute 
responsible for the information resource: at least the name or the organization and 
address is required </author> 

<Keyword> required: a list of keywords </Keyword> 

<Publisher> OPTIONAL in case of a published version </Publisher> 
<PublIPlace> OPTIONAL in case of a published version </PublPlace> 
<Code>OPTIONAL in case of a published version </Code> 

<Dates> 

<Created> required: </Created> 

<Expiry> OPTIONAL </Expiry> 

<Updated> system generated </Updated> 

</Dates> 

<Version> OPTIONAL: version of the resource </Version> 

<Coverage> OPTIONAL: nature of the resource </Coverage> 

<Classification> OPTIONAL: security level of the resource </Classification> 
<URL> A list of locations (URL) Unique Universal Resource Locator/Call No for 
this resource: at least one required </URL> 

<URN> unique name of the resource (URN) </URN> 

<UAS> site where the item is to be archived </UAS> 

<Abstract> OPTIONAL but recommended </Abstract> 

<Annotation> OPTIONAL </Annotation> 

<SysReq> OPTIONAL: list of requirements in hardware and software 
<Hardware> OPTIONAL: list of hardware required </Hardware> 

<Software> OPTIONAL: list of software required </Software> 

</SysReq> 

<size> size of the resource in bytes </size> 

<Cost> OPTIONAL: cost of accessing the resource </Cost> 

<control> 

<Ac> account number </Ac> 

<password> required: encoded password or digital signature of provider of resource 
for initial entry and subsequent update </password> 


<signature> digital signature of the resource for authentication </signature> 
</control> 
</semhdr> 


Figure 2. Structure of the Semantic Header 


<semhdr> 

<title>Semantic Header and Indexing and Searching on the Internet</title> 
<alt—title>Sailing the Internet with a navigational System</alt—title> 
<Subject> 

<ul> 

<li> 

<General>Computer Science </General> 
<Sublevel1>Information Storage and Retrieval</Sublevel1> 
<Sublevel2>indexing</Sublevel2> 

</li> 

<li> 

<General>Library Studies</General> 
<Sublevel1>cataloging</Sublevel1> 

<Sublevel2>semantic header</Sublevel2> 

</li> 

<li> 

<General>Computer Science </General> 

<Sublevel1> Artificial Intelligence</Sublevel1> 
<Sublevel2>expert systems</Sublevel2> 

</li> 

<li> 

<General>Computer Science </General> 
<Sublevel1>Database Management</Sublevel1> 
<Sublevel2>distributed databases</Sublevel2> 

</li> 

</ul> 

</Subject> 

<Language> English </Language> 

<Character> ISO-8879 </Character> 

<author> 

<ul> 

<li><aname>DESAJ, Bipin C.</aname> 

<aorg>Concordia University, Department of Computer Science</aorg> 


<aAddress>7141 Sherbrooke Street West, Montreal, QC, CANADA, H4B 126 


</aAddress> 

<aphone>(514) 848 3025</aphone> 
<aFax>(514) 848 8652</aFax> 
<aemail>bcdesai @cs.concordia.ca</aemail> 
</li> 

</ul> 
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The amount of publicly available information resources on the Web is increasing rapidly. As this trend 
continues, finding these resources becomes more difficult. Several systems, notably Archie, Jumpstation, 
Lycos, WebCrawler, RBSE Index, and Harvest have attempted to solve this problem by building indices and 
allowing users to search them. 


These Web-wide indexing solutions do not, in practice, share information with each other. In addition, the 
databases often fail to provide users with specific or complete answers to their queries. 


This workshop has two goals: to find ways that index providers can share indexing and Web-structure 
information, and to explore ways to improve the query experience for users. Both goals stem from the need to 
address the increasing scale of the Internet: as the size of the problem increases, we need to be more efficient at 
building indices, and better at focusing on what the users are looking for. As we find solutions to these 
problems, it will enable us to build more efficient, consistent, and powerful indices of the World—Wide Web. 


In addition to a general discussion of Web-wide indexing, the workshop will have two specific tasks: 


1. to examine some of the existing protocols for sharing information to see what we can use and what we 
need to build, and 
2. to envision an operational plan for putting these tools to use on an experimental basis. 


Allowing indices to cooperate and exchange information can help solve several problems. First, it would allow 
the index builders to do a more efficient job of indexing. Indexing information for Europe might be collected 

in Europe, then transmitted in bulk to the United States. Or, we may find ways to build a distributed index, and 
avoid even the bulk transmission. Second, it would enable different retrieval engines to run against the same set 
of indexing information, providing better service for users and opportunities for research on different kinds of 
retrieval. Finally, it would server as an experimental tool for learning about decentralized indexing. 


This workshop is not meant to set any standards for indexing or exchange of indexing information. However, if 
it serves as a starting point for the experimentation that will give us the experience necessary to propose 
standards in the future, should that be desirable. 


The workshop environment is the ideal place to do this work; we will bring together people with experience 
building and running Web-wide indices. The ideal participant would be involved in building or operating an 
Internet information discovery system, or an expert in the field of database systems, distributed computing, 
expert systems, information retrieval, or library studies. 
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Summary 


What to index? 


Use the anchor term used for the HTML links, 
Use the title and headings of the HTML page, 
Use the full text to create an index, 

Use the filename of the HTML resource, 
Word occurrence — URL pairs. 

Inverted indices of keywords 

Indexes of interesting keywords 


How are index created? 


e Use a robot to scan the Web for new and changed HTML resource 
e Server side support based systems, 
e Different frequency of updates 


What is indexed? 


e Approximately twenty search engines with accompanying services for parts of the WWW, 
e Each covers a part of the Web, 
e Gateways to other indexing services such as WAIS 


What is needed? 


Establish a Web Indexers’ Working Group, 

Hierarchical searching, 

Share information and avoid replication, 

Parallel, fault-tolerant, and scalable index server, 
Language(Natural) independent 

Find all the having concept/property ----—— ; 
Index other form of resources: images, graphics or sound, 


e Capture structure of the Web, hyper—media, 
e Common interface, 
e Index generation with revision control 
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Abstract 


This paper describes an indexing system called semantic header for Internet resources. 


The semantic header contains the meta—information for each "publicly" accessible 
resource on the Internet. It also describes the registering system and the distributed 
database representing the union catalog of resources on the Internet. This database 
would be used in a search system to facilitate search. 


Introduction 


The trend in most research institutes, universities and business organization to 
interconnect their computing facilities using a digital network has become the 
accepted method of sharing resources. Such networks, in turn, are interconnected 
allowing information to be exchanged across networks using a common interchange 
protocol(viz TCP/IP). The number of such interconnected networks (Internet) 
continues to grow and with the emergence of powerful workstation—based servers 
connected to these networks, it is possible to support local as well as the remote 
search and retrieval of information stored on any component of the interconnection. 
At this time a number of information sources, both public(free) and private(available 
for a fee), are available on the Internet. They include text, computer programs, books, 
electronic journals, newspapers, organizational, local and national directories of 
various types, sound and voice recordings, images, video clips, scientific data, and 
private information services such as price lists and quotations, databases of products 
and services, and speciality newsletters. 


There is a need for the development of a system which allows easy ’search for and 
access to’ resources available on the Internet. It has been observed that distributed 
information systems, even though under control of a single administrative unit, create 
multiple problems typically caused by differences in semantics and representation, 
incomplete and incorrect data dictionaries (cataloging) [DESA4]. These problems 


would be magnified manyfold in any distributed information system which tries to 
integrate the resources offered by information systems over the Internet. It is 
important, also, to avoid problems encountered in a library system where, in spite of 
the fact that while the same cataloging system[2] is used, the same item may be 
differently catalogued/classified in two different libraries. 


Such problems could be avoided by starting with a standard index structure and 
building a bibliographic system using standardized control definitions. Such 
definitions could be built into the knowledgebase of an expert system based index 
entry and search interfaces. Furthermore, there must be a mechanism to revise index 
information as the resource changes over time. Finally, annotation of a resource by 
independent users should be allowed. 


The bibliographic entry system should be distributed and accessible to providers as 
well as users of the Internet. In a distributed system such as the Internet, it is natural 
to have the providers of resources, prepare and enter the bibliographic information 
about each resource using the standardized index scheme. The entry system should be 
a distributed system and the index should be recorded in a distributed database. 
Finally, a search system to help in locating and retrieving appropriate information 
with ease from this database is required. 


Whereas the bibliographic entry and search systems (clients) could be located locally 
at the providers and users of information resources respectively, the bibliographic 
database system(server) should be distributed and replicated at a number of regional 
nodes for enhanced availability and response. The entry and search systems have to be 
supported by an easy-to-use graphical interface for entering the index information 
and access to it. These systems should incorporate the expertise and knowledge of 
expert cataloguers and reference librarians with help system to guide the user at all 
steps. The search system, should in addition provide appropriate feedback indicating 
the number of hits for each search, and help in providing access to the relevant 
resources. The navigation of database and resource nodes and the protocols and filters 
used would be selected by the system, thus facilitating the task of the user. The 
purpose is to provide uniform access to all resources, as is done in the centralized 
information system through the intermediary of an expert system analyst. The overall 
structure of such a system is given in Figure 1 


Source of Information and Meta—Information 


Information sources can be classified into three categories[ KATZ]: primary, 
secondary and tertiary. Primary information is the original material in the form of 
published or posted articles, monographs, reports, dissertations, programs, images, 
movies, etc. Other primary sources such as personal communications are not usually 
available. Secondary sources, sometimes called meta—information, are used as indices 
to these primary sources of information and are created after a delay which may be a 
few months to a few years. The meta—information is data about the primary source. A 
tertiary source of information is a combination of selected and distilled information 


from primary and secondary sources. 


The purpose of indices and bibliographies (secondary information) is to inventory the 
primary information and allow easy access to it. Preparing a bibliography requires 
finding the primary source, identifying it as to its subject, etc., describing it for later 
matching for unknown future users and classifying it according to accepted norms. 


Since an index is to be used by many users, it has to be accurate, easy to use (usage via 
author, title, subject, etc.) properly classified, up—to date and complete for its area of 
coverage. In order for a bibliography to be useful, it must fill a real need. The success 
of Archie as a bibliography system (for files available on the Internet via FTP) is that 
it provides a simple interface to users who are aware of the name of a program, file or 
the general nature of the file likely distributed from one or more anonymous FTP 
sites. In the case of the on-line bibliography to the Internet resources such as the Web, 
the need is for the system to be current within a short period (minutes or at most 
hours) of the posting of a new resource. Compare this with the bibliography system 
for printed publication which requires weeks or months in the case of the on-line 
databases, longer for the CD version and upto years for the printed version. Even the 
on-line database needs a considerable amount of time before documents are indexed 
in bibliography. 


The method of compiling a traditional bibliography varies. At one extreme, we have 
scholars spending years of their lives evaluating sources and compiling annotated and 
descriptive entries for each item. The accuracy of this bibliography is high but the 
coverage tends to be limited. At the other extreme, we have the semi—automatic 
mechanism which scans the published works from limited sources (by domain, 
language, or geographic regions) and assigns each work to appropriate sub—subject(s). 
Access from multiple headings may be provided. This is desirable because an item 
may deal with more than one topic. Whereas the bibliography prepared in the former 
method could be more accurate it tends, however, to be retrospective rather than 
current. 


The dependence on titles as a search criterion dictates that they must be indicative of 
the contents of the document. This is not always the case hence someone (the author or 
the cataloger) has to add annotation, keywords or key phrases to indicate the actual 
content. Accuracy or quality of a document can be indicated by including reviewers’ 
opinions. However, such opinions are rarely accessible to the cataloger. Another 
feature of importance to the user of an index, is the presence of an accurate abstract. 
An abstract provides a summary of the material and thus is more indicative of the 
contents than the title or keywords supplied by the author, bibliographer or selected 
from scanning the text. Reference librarians and library users tend to use such 
annotated bibliographies to help choose among competing sources. 


Features such as division of the bibliography by subject and sub-subjects, though of 
concern in the manual systems, should not be apparent in the electronic form. 
However, access through these criteria must be supported. Weeding of bibliography 


entries, which are for Internet resources no longer accessible, though attractive may 
require careful thought from the point of completeness. The archiving of resources in 
central libraries could mean that such weeding of the bibliography would not be 
necessary. 


A Cataloging and Searching System 


Library catalogs are prepared by a specialist and for each entry, it records the author, 
title, publisher, place of publication, date of publications and other details. The term 
union list, in library lexicons, is used to refer to the catalog which is the union of the 
catalogs of a number of participating libraries. It indicates which item is located 
where. In this sense, the bibliography, forms a union list of all sources of documents. 
Since the item in question is not in electronic form, it requires the intermediary of the 
inter—library loan mechanism to borrow it (usually from the nearest location which 
permits the title to be borrowed or if possible to photocopy sections of it.) 


Currently a large number of documents exist in addition to the files whose names 
could be searched via systems such as Archie or Xarchie. The popularity of the World 
Wide Web[BERN, BERN3] and browsers such as Mosaic[MOSA] has prompted 
many researchers to start publishing on-line. Attempts to provide easy searching of 
relevant documents has lead to a number of systems including WAIS, and more 
recently a number of Spiders, Worms and other creepy crawlers.[DEBR, FLET, 
KOST, MCBR, META, THAU, SEAR, WEBC, WWWW] 


However, the problem with many of these indices is that their selectivity of 
documents is often poor. The chances of getting correct documents and missing 
relevant information because of poor choice of search terms is large. In addition, the 
user is required to access the actual resource, based on just the title and author 
information, as is provided through a library catalog, and decide whether the resource 
meets the needs. 


These problems are addressed in our proposed system by using an appropriate index 
entry called Semantic Header[BCD2] and providing a mechanism to register, manage 
and search the bibliography. The system is an active system requiring the provider of 
information to register the resource by entering an index entry for the resource. Since 
the provider is responsible for preparing the index entry, there is the potential for its 
accuracy to be high. 


The overall system uses knowledge bases and expert sub-systems to help the user in 
the register and search process. One such need for an expert system is in avoiding 
chaos introduced by differences in perception of different indexer. Hence, some form 
of standardization of terms used has to be enforced. We envisage this through the 
intermediary of an expert system based engine. The index generation and maintenance 
sub-system uses the knowledge and expertise of the expert cataloguer to help the 
provider of the resource select correct terms for items such as subject, sub—subject and 
keywords. Similarly, another expert system is used in the search sub-system to help 


the user in the search for appropriate information resources. The third component of 
the system is a distributed and replicated database of the bibliography to resources 
available on-line. The database is in the background and the users are not aware of its 
presence much less of its distributed and replicated nature. These components are 
described below. 


Semantic Header 


The heart of any bibliography or indexing system is the record that is kept for each 
item that is being indexed. Standardization of a bibliographic entry allows libraries to 
exchange information about their collections. A number of projects in the Library 
domain have addressed the problem of cataloging and in particular cataloging of 
information in electronic and multi-media format. CORE[CROM], MARC 
system[BRYN, CRAW, MARC, PETE], MLC[HORN, ROSS, RHEE] and 
TEI[GAYN, GIOR] are examples of some of these initiatives. These existing and 
proposed indexing systems range from a minimum to full level of bibliographic 
information. However, such systems are designed for professional catalogers and 
many of the items included in them, though useful, are beyond the comprehension of 
most providers or users of information. 


We have proposed a simple index structure called Semantic Header [DESA2] for 
resources accessible directly on the Internet. The structure of the index is similar to 
the ones used for most libraries indices and include other information deemed useful 
for on-line systems. The syntax of the semantic header is the HTML markup 
language[BERN2] which is based on the SGML markup language. However, the user 
working with the index entry system is guided through the process by an expert 
system. This system guides the user in the choice of standardized terms through an 
easy to use graphical interface. 


We give, in Figure 2 below, the structure of the Semantic Header. An example of use 
of the semantic header is given in Figure 3. The intent of the semantic header is to 
include those items that are most often used in the search of an information resource. 
Since the majority of search begins with a title, name of one of the authors (70%), 
subject and sub—subject (50%)[Katz], we have made the entry of these items to be 
mandatory in the semantic header. The abstract and annotations are useful in deciding 
whether the resource would be useful; these items are also included. Logically the 
entries in the semantic header are not positionally sensitive. However, for ease of use, 
we have arranged the fields in Figure 2 using the traditional library catalog layout. 


The first field of the semantic header is the title of the resource. It is a required field 
and is given within the tags beginning with <title> and terminated by </title>. The 
next field is a alt—title and is used to indicate a secondary title or an alternate title of 
the resource. This field is optional. The subject and the sub—subject of the resource is 
indicated in the next field which is a repeating group (a multi-part field with one or 
more occurrences of items in the group). All resources must have at least one 
occurrence for this field. 


The character set used and the language of the resource is given in the next two 
optional fields. The author of the resource is given in the next repeating group. The 
sub-fields are for name, organization, address, phone and fax numbers and e—mail 
address. All sub-fields except the name are optional except where the author is an 
organization in which case the organization must be given. The term author is used to 
include the role of programmer, creator, artist, etc. 


The list of keywords is included by a field marked by the tags <Keyword> ... 
</Keyword>. Each resource must have at least one keyword. If a published version of 
the resource is available, this is indicated by the next field which is followed by a 
place of publication and appropriate publication code (code name and number) such as 
ISBN followed by a number. 


The dates of creation(required), expiry and update, if any, are given next. The version 
number, if any, the intended coverage and the security or distribution classification is 
indicated in the next three fields. 


The location (URL[BERN1]) of the item is indicated by the next field indicated by 
the tags <URL> ... </URL>. It could include a list of one or more locations where the 
item may be available. The URN[RFC1737] field gives the unique name of the item, 
if any. This name may be used instead of a location (URL) if the item is likely to 
move or may be accessible from multiple locations[3]. 


The semantic header contains an entry for an archive site. The field UAS (Universal 
archive site) is used to indicate the archive site for the resource. It is expected that the 
resource will exist at this site beyond the expiry date of the resource, if any. Of 
course, the site itself is guaranteed to exist beyond the life of any resource. It is 
envisaged that the archive site could be an independent resource provider. One 
example of such a traditional resource provider is the national library in most 
countries. One possibility is for the national libraries such as the Library of Congress 
in U.S., British Library, National Library and CISTI in Canada, to archive Internet 
resources. However, private, for profit, corporations could be alternate sites for 
archiving resources. Archiving would provide an anchor for the otherwise ephemeral 
nature of some resources on the network. 


The abstract and annotations are given in the next fields. The abstract is provided by 
the author of the resource; the annotations are made by independent users of the 
resource. The annotation cannot be modified and includes the identity of the user 
along with a digital signature. 


List of hardware and software required is included in the semantic header as a 
repeating group. This is followed by the size of the resource and the cost of accessing 


it{4]. 


The last set of items in the semantic header is the control items such as the account to 
which credits are to be made for charges for accessing the resource, encoded 


passwords or the digital signature of the provider of the resource. Any change to the 
updatable part of the semantic header requires the password or digital signature. 
Another control piece of information is the digital signature of the resource itself. 
This may be used to authenticate the resource when it is retrieved through a semantic 
header. It is assumed that there is a mechanism to access the resource’s digital 
signature. 


<semhdr> 

<title> required </title> 

<alt—-title> OPTIONAL </alt—title> 

<Subject> a list each of which includes fields for subject and up to two levels of 
sub-subject: at least one entry is required </Subject> 

<char-set> character set used: OPTIONAL </char-set> 

<language> of the information resource: OPTIONAL </language> 

<author> required 

a list each of which includes name, organization, address, etc. of each person/institute 
responsible for the information resource: at least the name or the organization and 
address is required </author> 

<Keyword> required: a list of keywords </Keyword> 

<Publisher> OPTIONAL in case of a published version </Publisher> 
<PublIPlace> OPTIONAL in case of a published version </PublPlace> 
<Code>OPTIONAL in case of a published version </Code> 

<Dates> 

<Created> required: </Created> 

<Expiry> OPTIONAL </Expiry> 

<Updated> system generated </Updated> 

</Dates> 

<Version> OPTIONAL: version of the resource </Version> 

<Coverage> OPTIONAL: nature of the resource </Coverage> 

<Classification> OPTIONAL: security level of the resource </Classification> 
<URL> A list of locations (URL) Unique Universal Resource Locator/Call No for 
this resource: at least one required </URL> 

<URN> unique name of the resource (URN) </URN> 

<UAS> site where the item is to be archived </UAS> 

<Abstract> OPTIONAL but recommended </Abstract> 

<Annotation> OPTIONAL </Annotation> 

<SysReq> OPTIONAL: list of requirements in hardware and software 
<Hardware> OPTIONAL: list of hardware required </Hardware> 

<Software> OPTIONAL: list of software required </Software> 

</SysReq> 

<size> size of the resource in bytes </size> 

<Cost> OPTIONAL: cost of accessing the resource </Cost> 

<control> 

<Ac> account number </Ac> 

<password> required: encoded password or digital signature of provider of resource 
for initial entry and subsequent update </password> 


<signature> digital signature of the resource for authentication </signature> 
</control> 
</semhdr> 


Figure 2. Structure of the Semantic Header 


<semhdr> 

<title>Semantic Header and Indexing and Searching on the Internet</title> 
<alt—title>Sailing the Internet with a navigational System</alt—title> 
<Subject> 

<ul> 

<li> 

<General>Computer Science </General> 
<Sublevel1>Information Storage and Retrieval</Sublevel1> 
<Sublevel2>indexing</Sublevel2> 

</li> 

<li> 

<General>Library Studies</General> 
<Sublevel1>cataloging</Sublevel1> 

<Sublevel2>semantic header</Sublevel2> 

</li> 

<li> 

<General>Computer Science </General> 

<Sublevel1> Artificial Intelligence</Sublevel1> 
<Sublevel2>expert systems</Sublevel2> 

</li> 

<li> 

<General>Computer Science </General> 
<Sublevel1>Database Management</Sublevel1> 
<Sublevel2>distributed databases</Sublevel2> 

</li> 

</ul> 

</Subject> 

<Language> English </Language> 

<Character> ISO-8879 </Character> 

<author> 

<ul> 

<li><aname>DESAJ, Bipin C.</aname> 

<aorg>Concordia University, Department of Computer Science</aorg> 


<aAddress>7141 Sherbrooke Street West, Montreal, QC, CANADA, H4B 126 


</aAddress> 

<aphone>(514) 848 3025</aphone> 
<aFax>(514) 848 8652</aFax> 
<aemail>bcdesai @cs.concordia.ca</aemail> 
</li> 

</ul> 


</author> 

<Keyword> 

<ul> 

<li> Bibliographic record</li> 

<li>Content description</li> 

<li>Database Systems</li> 

<li>Expert Systems</li> 

<li> Indexing </li> 

<li>Searching</li> 

<li> URC </li> 

</ul> 

</Keyword> 

<Dates> 

<Created> 1994—07-11</Created> 

<Expiry>1995—08-07</Expiry> 

<Updated> 1995—02-07</Updated> 

</Dates> 

<Version> 1.0 </Version> 

<Coverage> Universal </Coverage> 

<Classification>Public </Classification> 

<URL> http://www.cs.concordia.ca/~faculty/bcdesai/cindi—system—1.0.html</URL> 
<URN><comment> Unique Universal Resource Name for this resource. No such 
service exists to date. In the absence of one, we use the concatenation of Title, first 
author, first subject creation date and version number. Do we really need another level 
of complexity especially if we have a good index and catalogue system? Is the current 
system of using domain name followed by other names not good enough? It is the 
most distributed version possible. Here domain names not only signify Internet 
domain but other domains such as ISBN, UPC, etc. </comment> 

Semantic Header and Indexing and Searching on the Internet|Computer 
Science|Information Storage and Retrievallindexing|DESAL, Bipin C.|1994—-07-11]1.0 
</URN> 

<UAS><comment> Universal Archive Site where this document is 
archived</comment> ftp://ftp.cs.concordia.ca/bced/cindi-system—1.0.html</UAS> 
<abstract>This paper describes an indexing system called semantic header for Internet 
resources. The semantic header contains the meta—information for each "publicly" 
accessible resource on the Internet. It also describes the registering system and the 
distributed database representing the union catalog of resources on the Internet. This 
database would be used in a search system to facilitate search.</abstract> 
<Annotation></Annotation> 

<size> 44000 </size> 

<Cost><comment>Cost, Currency<comment> 0.27, Can$</Cost> 

<control> 

<Ac> BCD’s Swiss number a/c </Ac> 

<password> thequickbrownfoxjumpsoverthelazydog </password> 

<signature> 01001010101110101101010110011101 </signature> 

</control> 


</semhdr> 
Figure 3 An example of a Semantic Header Entry 
Index Registering Sub-system 


The index entry and registering sub-system provides a graphical interface (Figure 4) 
to facilitate the provider (author/creator) of a resource to register the bibliographic 
information about the resource. The interface allows the provider to enter the 
information and it provides help by means of pop-up selection windows and an expert 
engine (not shown in Figure 4) to suggest controlled terms. Once the information is 
correctly entered the author can decide to register the Semantic Headed entry in the 
Semantic Header database. When the header information is accepted by the database, 
the author/creator is notified. A password or a digital signature is to be provided 
when the semantic header is first registered and for all changes made to it. Since the 
encoded password or digital signature is not accessible by anyone other than the 
original registrar of the index entry, the entry can only be updated by person(s) who 
are cognizant of it. Changes that may be made could be due to changes made in the 
resource or its migration from one system to another. A copy of the semantic header 
is stored at the site of the resource. It is desirable that the semantic header be attached 
to the actual resource. However, this can not be done until all hardware and/or 
software systems can handle such a header (viz. ignore it). 


The system verifies the accessibility of the resource being added. Also the digital 
signature of the resource is retrieved and added to the semantic header. The purpose of 
this last piece of information is to establish the veracity of the resource when it is 
retrieved through a semantic header. If the resource is corrupted, this veracity 
validation would fail and the user would be notified; no charges, if there are any, 
would be made. 


If each resource is given a unique name (URN), the semantic header database can be 
used for mapping from URN to URL. Since only one semantic header could be 
associated with a given URN, a search with a given URN will retrieve at most one 
semantic header. One of the URLs in it can be used to access the resource in question. 
This form of search can be implemented at a low level without the need for a 
graphical interface. 


The index entry that is registered is communicated to a database described below. 
The Semantic Header Distributed Database System 


The index entries registered by a provider of a resource is stored in a distributed 
database system (SHDDB). From the point of view of the users of the system, the 
underlying Semantic Header database may be considered to be a monolithic system. In 
reality, it would be distributed and replicated allowing for reliable and 

failure—tolerant operations. The interface hides the distributed and replicated nature 


of the database. The distribution is based on subject areas and as such the database is 
considered to be horizontally partitioned [DESAS]. 


It is envisaged that the database on different subjects will be maintained at different 
nodes of the Internet. The locations of such nodes need only be known by the intrinsic 
interface. A database catalog would be used to distribute this information. However, 
this catalog itself could be distributed and replicated as is done for distributed 
database systems. 


The Semantic Header information entered by the provider of the resource using a 
graphical interface is relayed from the user’s workstation by a client process to the 
database server process at one of the nodes of the SHDDB. The node is chosen based 
on its proximity to the workstation or on the subject of the index record. On receipt of 
the information, the server verifies the correctness and authenticity of the information 
and on finding everything in order, sends an acknowledgment to the client. 


The server node is responsible for locating the partitions of the SHDDB where the 
entry should be stored and forwards the replicated information to appropriate nodes. 
For example, the semantic header entry of Figure 3 would be part of the SHDDB for 
subjects Computer Science and Library Studies. 


Similarly the database server process is responsible for providing the catalogue 
information for the search system. In this way the various sites of the database work 
in a cooperating mode to maintain consistency of the replicated portion. The 
replicated nature of the database also ensures distribution of load and ensures 
continued access to the bibliography when one or more sites are temporarily 
nonfunctional. The performance of search with the growing size of the SHDDB 
database could be improved by using techniques used in databases[.DESA6]. 


The Search System 


The guiding principle of the design of the search system uses the model of a human 
reference librarian. S/he is called on to help in identifying the best sources of 
information for a given purpose and to aid in the selection of materials to meet a 
particular interest or need. The reference librarian seeks the responses to these queries 
by using information derived from bibliographic search processed through the 
librarians own expertise and knowledge of the relevant subject. In addition, users of a 
library have access to the same bibliographic indices and many of the information 
databases from which they are called on to select relevant titles or weed out irrelevant 
ones. 


A typical query to a reference librarian can be divided into two categories: known and 
unknown[KATZ]. In the former, a user asks for an item identified by author, title, or 
publication source. In the latter, the need of the user is fuzzy; s/he has no idea of any 
of the identifiers of the needed item. Even in the case of the known queries, there is 
the possibility that the user may have the wrong author, right author but the wrong 


title, wrong dates or incorrect volume number or issue number for a serial. It may 
also happen that even when these are correct, the item is not the one that meets the 
need of the user. 


A specific search and research type query may require the user to peruse a number of 
titles and select from among them. This type of query involves users who have fuzzy 
notions of their needs and their questions are vague. They involve a certain amount of 
trial and error retrieval of documents and their browsing. 


One problem that human librarians deal with is that of the inability of the users to ask 
the relevant questions. The reference librarian, through a dialog with the user tries to 
narrow down the user’s needs in terms of what and how much information is 
required. In many cases the librarian is called upon to match the user needs with the 
sources of information. For example, an article from the popular press may be 
appropriate for a lay person as opposed to one appearing in a prestigious journal 
dedicated to the subject. 


In the search component of the proposed system we plan to incorporate the expertise 
used by a reference librarian. This expertise will guide the user in entering the various 
search items in a graphical interface similar to the one used by the index entry system 
(Figure 5). The expert search sub-system requires the expertise of a reference 
librarian to be built into it to help users formulate queries and launch these queries. 
As in the case of the index generation sub-system, the expert system provides help in 
choosing appropriate search terms for index entries such as subject, sub—subject, 
keywords etc. The expert system, for optimization of search, uses the following type 
of statistics for a typical use of a bibliography[Katz] 


-70% of the queries is by use of titles, or by author’s name 


—50% of queries start with subject and it tends to be complex requiring subdivision 
and refinement. 


The search system also uses a graphical interface and a client process. Once the user 
has entered a search request, the client process communicates with the nearest 
SHDDB catalogue to determine the appropriate site of the SHDDB database. 
Subsequently, the client process communicates with this database and retrieves one or 
more semantic headers. The result of the query could than be collected and sent to the 
user’s workstation. The contents of these headers are displayed, on demand, to the user 
who may decide to access one or more of the actual resources using a graphical 
window as in Figure 6. It may happen that the item in question may be available from 
a number of sources. In such a case the best source is chosen based on optimum costs. 
The client process would attempt to use appropriate hardware/software to retrieve the 
selected resources. 


Annotations and Reviewing 


The scientific world depends on peer review of documents submitted for publication. 
Such annotation used for reviews tend not to be published. However, comments to the 
editor made by readers of the serials are usually published and are accessible to the 
community. Since many of the resources on the Internet tend to be non—reviewed, it 
would be useful for a user to have access to annotations made by other users for a 
given resource. The proposed system allows users to add annotations to an existing 
resource. These annotations are stored along with the index in the SHDDB. 


The annotation sub-system is similar to the indexing subsystem. However, only a few 
of the indexing entries, to uniquely identify the resource in question, are required 
(Figure 7). An annotation made by any user can be entered and would be registered 
with the identity and digital signature of the user. Each annotation could than be 
incorporated in the index entry (at least logically) and could be retrieved with the 
index. Such annotations, by recognized persons would be a valuable guide for future 
users. 


The peer reviews of electronically submitted papers could be implemented using such 
annotations. Authentication of reviews has to be done by an appropriate editorial 
board. 


Conclusions: Advantages of the approach 


Current index systems are based on harvesting the network for new documents and 
such documents are retrieved and their contents used to provide terms for the index. 
The big disadvantage with his scheme is the unreliability of the index entries 
produced and the lack of an authentic abstract for the item. Currently, such schemes 
are relevant for Web text documents and are not applicable to other resources. 
Another problem with this approach is the unnecessary traffic on the network and 
lack of cooperation and sharing among different systems. Finally, the unfeasibility of 
this approach as more and more providers of information would require payments. 
Creating an index would require payment. Furthermore, users, without having a better 
idea of their contents, would not be inclined to retrieve resources which, from their 
titles, seem irrelevant. 


In the proposed system, the provider of the resource is the one who prepares the index 
information. Consequently, such index entry would be more reliable than the one 
derived by a third party or by simply scanning a document. The presence of an 
abstract affords the provider of the resource to give a pertinent abstract or summary. 
Such a summary in the index allows users to make better informed decisions 
regarding the relevance of the source resource. 


The system provides an expert system—driven graphical interface for the provider of 
the resource to produce an index entry, and have this entry entered in the index 
database. The expert system provides help in choosing appropriate terms for index 
entries such as subject, sub-subject, keywords etc. It also is responsible for verifying 
the consistency of the index entry and accessibility of the resource and then posting 


the index entry to the index database. 


In addition, the index database contains a number of control entries for the resource. 
Control entries are items such as size of the resource, the password for authenticating 
subsequent updates of the index entry, and a list of annotations made about the 
resource by independent users 
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[1] This paper describes the CINDI subsystem — a part of the CUILT Project for 
Developing a Virtual Library Prototype. 


[2] Libraries use a number of basic catalogue systems such as s Library of Congress, 
Dewey Decimal and MARC. Even among MARC there are slight differences as in 
LCMARC and CANMARC. 


[3] The idea of the semantic header is to provide bibliographic information about 
resources and by including both the URN and a list of URLs it also provides a mapping 
from URN to URL. 


[4] Such costs could change over time and require updating. 
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Figure 3 An example of a Semantic Header Entry 
Index Registering Sub-system 


The index entry and registering sub-system provides a graphical interface (Figure 4) 
to facilitate the provider (author/creator) of a resource to register the bibliographic 
information about the resource. The interface allows the provider to enter the 
information and it provides help by means of pop-up selection windows and an expert 
engine (not shown in Figure 4) to suggest controlled terms. Once the information is 
correctly entered the author can decide to register the Semantic Headed entry in the 
Semantic Header database. When the header information is accepted by the database, 
the author/creator is notified. A password or a digital signature is to be provided 
when the semantic header is first registered and for all changes made to it. Since the 
encoded password or digital signature is not accessible by anyone other than the 
original registrar of the index entry, the entry can only be updated by person(s) who 
are cognizant of it. Changes that may be made could be due to changes made in the 
resource or its migration from one system to another. A copy of the semantic header 
is stored at the site of the resource. It is desirable that the semantic header be attached 
to the actual resource. However, this can not be done until all hardware and/or 
software systems can handle such a header (viz. ignore it). 


The system verifies the accessibility of the resource being added. Also the digital 
signature of the resource is retrieved and added to the semantic header. The purpose of 
this last piece of information is to establish the veracity of the resource when it is 
retrieved through a semantic header. If the resource is corrupted, this veracity 
validation would fail and the user would be notified; no charges, if there are any, 
would be made. 


If each resource is given a unique name (URN), the semantic header database can be 
used for mapping from URN to URL. Since only one semantic header could be 
associated with a given URN, a search with a given URN will retrieve at most one 
semantic header. One of the URLs in it can be used to access the resource in question. 
This form of search can be implemented at a low level without the need for a 
graphical interface. 


The index entry that is registered is communicated to a database described below. 
The Semantic Header Distributed Database System 


The index entries registered by a provider of a resource is stored in a distributed 
database system (SHDDB). From the point of view of the users of the system, the 
underlying Semantic Header database may be considered to be a monolithic system. In 
reality, it would be distributed and replicated allowing for reliable and 

failure—tolerant operations. The interface hides the distributed and replicated nature 


of the database. The distribution is based on subject areas and as such the database is 
considered to be horizontally partitioned [DESAS]. 


It is envisaged that the database on different subjects will be maintained at different 
nodes of the Internet. The locations of such nodes need only be known by the intrinsic 
interface. A database catalog would be used to distribute this information. However, 
this catalog itself could be distributed and replicated as is done for distributed 
database systems. 


The Semantic Header information entered by the provider of the resource using a 
graphical interface is relayed from the user’s workstation by a client process to the 
database server process at one of the nodes of the SHDDB. The node is chosen based 
on its proximity to the workstation or on the subject of the index record. On receipt of 
the information, the server verifies the correctness and authenticity of the information 
and on finding everything in order, sends an acknowledgment to the client. 


The server node is responsible for locating the partitions of the SHDDB where the 
entry should be stored and forwards the replicated information to appropriate nodes. 
For example, the semantic header entry of Figure 3 would be part of the SHDDB for 
subjects Computer Science and Library Studies. 


Similarly the database server process is responsible for providing the catalogue 
information for the search system. In this way the various sites of the database work 
in a cooperating mode to maintain consistency of the replicated portion. The 
replicated nature of the database also ensures distribution of load and ensures 
continued access to the bibliography when one or more sites are temporarily 
nonfunctional. The performance of search with the growing size of the SHDDB 
database could be improved by using techniques used in databases[.DESA6]. 


The Search System 


The guiding principle of the design of the search system uses the model of a human 
reference librarian. S/he is called on to help in identifying the best sources of 
information for a given purpose and to aid in the selection of materials to meet a 
particular interest or need. The reference librarian seeks the responses to these queries 
by using information derived from bibliographic search processed through the 
librarians own expertise and knowledge of the relevant subject. In addition, users of a 
library have access to the same bibliographic indices and many of the information 
databases from which they are called on to select relevant titles or weed out irrelevant 
ones. 


A typical query to a reference librarian can be divided into two categories: known and 
unknown[KATZ]. In the former, a user asks for an item identified by author, title, or 
publication source. In the latter, the need of the user is fuzzy; s/he has no idea of any 
of the identifiers of the needed item. Even in the case of the known queries, there is 
the possibility that the user may have the wrong author, right author but the wrong 


title, wrong dates or incorrect volume number or issue number for a serial. It may 
also happen that even when these are correct, the item is not the one that meets the 
need of the user. 


A specific search and research type query may require the user to peruse a number of 
titles and select from among them. This type of query involves users who have fuzzy 
notions of their needs and their questions are vague. They involve a certain amount of 
trial and error retrieval of documents and their browsing. 


One problem that human librarians deal with is that of the inability of the users to ask 
the relevant questions. The reference librarian, through a dialog with the user tries to 
narrow down the user’s needs in terms of what and how much information is 
required. In many cases the librarian is called upon to match the user needs with the 
sources of information. For example, an article from the popular press may be 
appropriate for a lay person as opposed to one appearing in a prestigious journal 
dedicated to the subject. 


In the search component of the proposed system we plan to incorporate the expertise 
used by a reference librarian. This expertise will guide the user in entering the various 
search items in a graphical interface similar to the one used by the index entry system 
(Figure 5). The expert search sub-system requires the expertise of a reference 
librarian to be built into it to help users formulate queries and launch these queries. 
As in the case of the index generation sub-system, the expert system provides help in 
choosing appropriate search terms for index entries such as subject, sub—subject, 
keywords etc. The expert system, for optimization of search, uses the following type 
of statistics for a typical use of a bibliography[Katz] 


-70% of the queries is by use of titles, or by author’s name 


—50% of queries start with subject and it tends to be complex requiring subdivision 
and refinement. 


The search system also uses a graphical interface and a client process. Once the user 
has entered a search request, the client process communicates with the nearest 
SHDDB catalogue to determine the appropriate site of the SHDDB database. 
Subsequently, the client process communicates with this database and retrieves one or 
more semantic headers. The result of the query could than be collected and sent to the 
user’s workstation. The contents of these headers are displayed, on demand, to the user 
who may decide to access one or more of the actual resources using a graphical 
window as in Figure 6. It may happen that the item in question may be available from 
a number of sources. In such a case the best source is chosen based on optimum costs. 
The client process would attempt to use appropriate hardware/software to retrieve the 
selected resources. 


Annotations and Reviewing 


The scientific world depends on peer review of documents submitted for publication. 
Such annotation used for reviews tend not to be published. However, comments to the 
editor made by readers of the serials are usually published and are accessible to the 
community. Since many of the resources on the Internet tend to be non—reviewed, it 
would be useful for a user to have access to annotations made by other users for a 
given resource. The proposed system allows users to add annotations to an existing 
resource. These annotations are stored along with the index in the SHDDB. 


The annotation sub-system is similar to the indexing subsystem. However, only a few 
of the indexing entries, to uniquely identify the resource in question, are required 
(Figure 7). An annotation made by any user can be entered and would be registered 
with the identity and digital signature of the user. Each annotation could than be 
incorporated in the index entry (at least logically) and could be retrieved with the 
index. Such annotations, by recognized persons would be a valuable guide for future 
users. 


The peer reviews of electronically submitted papers could be implemented using such 
annotations. Authentication of reviews has to be done by an appropriate editorial 
board. 


Conclusions: Advantages of the approach 


Current index systems are based on harvesting the network for new documents and 
such documents are retrieved and their contents used to provide terms for the index. 
The big disadvantage with his scheme is the unreliability of the index entries 
produced and the lack of an authentic abstract for the item. Currently, such schemes 
are relevant for Web text documents and are not applicable to other resources. 
Another problem with this approach is the unnecessary traffic on the network and 
lack of cooperation and sharing among different systems. Finally, the unfeasibility of 
this approach as more and more providers of information would require payments. 
Creating an index would require payment. Furthermore, users, without having a better 
idea of their contents, would not be inclined to retrieve resources which, from their 
titles, seem irrelevant. 


In the proposed system, the provider of the resource is the one who prepares the index 
information. Consequently, such index entry would be more reliable than the one 
derived by a third party or by simply scanning a document. The presence of an 
abstract affords the provider of the resource to give a pertinent abstract or summary. 
Such a summary in the index allows users to make better informed decisions 
regarding the relevance of the source resource. 


The system provides an expert system—driven graphical interface for the provider of 
the resource to produce an index entry, and have this entry entered in the index 
database. The expert system provides help in choosing appropriate terms for index 
entries such as subject, sub-subject, keywords etc. It also is responsible for verifying 
the consistency of the index entry and accessibility of the resource and then posting 


the index entry to the index database. 


In addition, the index database contains a number of control entries for the resource. 
Control entries are items such as size of the resource, the password for authenticating 
subsequent updates of the index entry, and a list of annotations made about the 
resource by independent users 
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